iT邦幫忙

2021 iThome 鐵人賽

DAY 24
0
AI & Data

機器學習與前端網頁系列 第 24

Day 24 bert 文字情感分類-3

  • 分享至 

  • xImage
  •  

安裝繁簡轉換函式庫
pip install hanziconv

在昨天的分類中,把簡體評論的改成繁體。

import pandas as pd
import os
from hanziconv import HanziConv
all_df = pd.read_csv("ChnSentiCorp_htl_all.csv")

shuffled = all_df.sample(frac=1).reset_index(drop=True)

train_df = shuffled.iloc[:int(len(shuffled)*0.8)]
test_df = shuffled.iloc[int(len(shuffled)*0.8):]

mypaths = ["chinese/train/neg", "chinese/train/pos", "chinese/test/neg", "chinese/test/pos"]
for i in mypaths:
  os.makedirs(i, exist_ok=True)

for i, row in train_df.iterrows():
  if row["label"] == 1:
    with open("chinese/train/pos/" + str(i) + ".txt", "w", encoding="UTF-8") as f:
      f.write(HanziConv.toTraditional(str(row["review"])))
  if row["label"] == 0:
    with open("chinese/train/neg/" + str(i) + ".txt", "w", encoding="UTF-8") as f:
      f.write(HanziConv.toTraditional(str(row["review"])))


for i, row in test_df.iterrows():
  if row["label"] == 1:
    with open("chinese/test/pos/" + str(i) + ".txt", "w", encoding="UTF-8") as f:
      f.write(HanziConv.toTraditional(str(row["review"])))
  if row["label"] == 0:
    with open("chinese/test/neg/" + str(i) + ".txt", "w", encoding="UTF-8") as f:
      f.write(HanziConv.toTraditional(str(row["review"])))

將 tf.keras.preprocessing.text_dataset_from_directory 讀取的資料夾從 aclImdb 改為 我們剛才分好的 chinese

AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'chinese/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'chinese/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'chinese/test',
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

最後,將使用的 bert 模型,從 en (英文)

轉為 multi_cased (多語言)


上一篇
Day 23 bert 文字情感分類-2
下一篇
Day 25 bert 文字情感分類-4
系列文
機器學習與前端網頁30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言